Add payload-size-aware large-value zone eviction for EloqKV (ObjectCcMap)#415
Add payload-size-aware large-value zone eviction for EloqKV (ObjectCcMap)#415
Conversation
- Add txservice_large_value_threshold and txservice_large_value_eviction_age to tx_service_common.h. When threshold > 0, entries whose PayloadSize() exceeds the threshold are protected from LRU eviction until the page's LRU age (access_counter_ - last_access_ts_) reaches the eviction age. - Expose AccessCounter() getter on CcShard for use by the clean guard. - Modify CcPageCleanGuardWithoutKickoutCc::CanBeCleaned() to implement the protection logic, cooperating with existing LRU-based eviction. - Add optional configuration keys large_value_threshold and large_value_eviction_age to TxService::Init(). - Add BulkEmplaceFreeForTest() and SetPayloadForTest() helpers to TemplateCcMap for deterministic test setup. - Add test case validating that large-value entries are protected when the policy is active and evicted normally when protection is disabled. Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
|
|
- Expand the access_counter_ comment in cc_shard.h to cover both its primary use (O(1) relative-order comparison of two LRU pages during merge/redistribute) and its secondary use (measuring LRU age for the payload-size-aware eviction guard). - Add a comment to last_access_ts_ in cc_entry.h explaining that it stores the snapshot of access_counter_ at last-access time, that the difference (access_counter_ - last_access_ts_) is always >= 0 and a valid LRU-age proxy. - Enrich the comment in CcPageCleanGuardWithoutKickoutCc::CanBeCleaned() to explain the invariant that makes the age calculation correct and why it cooperates naturally with standard LRU eviction. Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
| { | ||
| return {false, false}; | ||
| } | ||
| } |
There was a problem hiding this comment.
这个算法存在两个问题:
shard.access_counter_是一个shard level的值,某个热点key就可以使它变得很大。用户不知道怎么配置txservice_large_value_eviction_age,也不想配置。
large value容易沉淀到LRU list的头部,但是因为它的age小于txservice_large_value_eviction_age,所以不会被踢掉。这造成了LRU 扫描的时候做了一些无效的工作,扫描了不能被踢的large value。
There was a problem hiding this comment.
Fixed in commit 55925be.
Both problems are resolved by replacing the absolute txservice_large_value_eviction_age threshold with a self-calibrating relative LRU-position check:
uint64_t now = cc_shard_->AccessCounter();
uint64_t page_age = now - page_->last_access_ts_;
uint64_t total_span = now - cc_shard_->LruOldestTs();
if (page_age * 2 < total_span) // page in recent half → protect
return {false, false};
// page in old half → allow eviction (fall through)Problem 1 (unconfigurable threshold): txservice_large_value_eviction_age is removed entirely. The check compares a page's age against the total span of the LRU list, so it self-calibrates regardless of how fast access_counter_ grows. No user configuration is needed.
Problem 2 (cold large values blocking the scan): A cold large-value page near the LRU head has page_age ≈ total_span, so page_age * 2 >= total_span — it is immediately eligible for eviction rather than wasting scan work.
Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
…e LRU check Problem 1: access_counter_ grows at unpredictable rates with hot keys, making txservice_large_value_eviction_age impossible to configure. Problem 2: cold large-value pages near the LRU head accumulate large absolute age but still smaller than the threshold, causing useless scan work on entries that cannot be evicted. Fix: replace the absolute threshold with a self-calibrating relative LRU-position check: page_age * 2 < total_span → page in recent half → protect page_age * 2 >= total_span → page in old half → evict where: page_age = access_counter_ - page->last_access_ts_ total_span = access_counter_ - LruOldestTs() This requires no user configuration and automatically adapts to any access rate. Cold large-value pages (near the head) always satisfy page_age * 2 >= total_span, so they are correctly evicted instead of blocking the scan. Hot pages (near the tail) always satisfy page_age * 2 < total_span, so they are correctly protected. - Add CcShard::LruOldestTs() returning last_access_ts_ of the LRU-head page. - Remove txservice_large_value_eviction_age global and config parsing. - Update comments in cc_shard.h, cc_entry.h, cc_page_clean_guard.h. - Update test: use two maps (old vs new) to verify old-half entries are evicted while new-half entries are retained under the relative check. Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
The previous relative-position check (page_age * 2 < total_span) was fundamentally flawed: when a large-value page drifts to the LRU head it IS the oldest page, so total_span == page_age and page_age * 2 >= total_span is always true — the page was always evictable at the head. New approach: LRU boost on protection. - CanBeCleaned (WithoutKickoutCc) protects any free entry whose PayloadSize() > txservice_large_value_threshold and sets a new mutable flag has_blocked_large_value_ in the guard. - CleanPage propagates the flag via new out_has_blocked_large_value parameter. - CleanPageAndReBalance: after RebalancePage captures next_page, if the flag is set and the page is non-empty and still in the LRU list, call UpdateLruList(page, false) to move it to the tail. After the boost page->lru_next_ == &tail_ccp_, so next_page = page->lru_next_ naturally stops the current scan. Effect: - Large-value pages are never evicted by the regular scan (they keep getting boosted whenever the scan visits them). - Small-value pages at the head are evicted normally (scan progresses past them before hitting the first large-value page). - The scan never loops: after boosting a page and stopping, the next scan starts from the head which no longer contains that large-value page (it was just moved to the tail). - No user-visible configuration change: txservice_large_value_threshold is still the only knob. Removes LruOldestTs() (no longer needed) from cc_shard.h. Restores single-map test (no more two-map approach). Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
The system uses mimalloc heap for memory statistics and pages are not fixed size, so a fixed-capacity ratio (large_value_zone_max_ratio) based on page count does not map to actual memory usage and is removed. The zone-separation approach is retained unchanged: head ← [small-value pages] ← lru_large_value_zone_head_ ← [large-value pages] ← tail Small-value pages are inserted before lru_large_value_zone_head_. Large-value pages are inserted before &tail_ccp_ (true MRU end). In the extreme case where all entries are large-value, the SV zone will be empty and SV pages will be evicted immediately even if accessed frequently — this is explicitly accepted. Removed: - txservice_large_value_zone_max_ratio global (tx_service_common.h) - bool in_large_value_zone_ from LruPage (cc_entry.h) - uint64_t large_value_zone_page_count_ / total_lru_page_count_ from CcShard (cc_shard.h) - LargeValueZonePageCount() / TotalLruPageCount() getters (cc_shard.h) - Count maintenance (++/-- in UpdateLruList, DetachLru) (cc_shard.cpp) - in_large_value_zone_ inheritance in ReplaceLru (cc_shard.cpp) - Ratio-enforcement while loop from UpdateLruList (cc_shard.cpp) - PART 2 ratio-enforcement test section (CcPage-Test.cpp) - References to removed globals/methods in test (CcPage-Test.cpp) Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
… sentinel
Evaluation: dummy sentinel is strictly better than the pointer approach.
Pointer approach (old):
- DetachLru must advance lru_large_value_zone_head_ when the zone-head
page is removed (3 extra lines).
- ReplaceLru must update lru_large_value_zone_head_ when the zone-head
page is replaced by defrag (3 extra lines).
- UpdateLruList must initialize lru_large_value_zone_head_ on first LV
page insertion and handle the sentinel-vs-data-page duality
(lru_large_value_zone_head_ == &tail_ccp_ for empty zone, data page
otherwise) (3 extra lines + conceptual complexity).
Dummy sentinel approach (new):
- head_large_ccp_ is always in the list as the permanent boundary:
head_ccp_ ← [SV pages] ← head_large_ccp_ ← [LV pages] ← tail_ccp_
- DetachLru: no zone-boundary update needed (sentinel never moves).
- ReplaceLru: no zone-boundary update needed (sentinel never moves).
- UpdateLruList: insert_before = has_large_value_ ? &tail_ccp_
: &head_large_ccp_. No empty-zone check needed.
- Clean(): skip sentinel pages (parent_map_ == nullptr) in the scan
loop (4 extra lines) — a small, clean addition consistent with how
the loop already skips head_ccp_ (via start-offset) and tail_ccp_
(via loop condition).
- Consistent with head_ccp_ / tail_ccp_ design.
Renamed: lru_large_value_zone_head_ → head_large_ccp_
LruLargeValueZoneHead() returns &head_large_ccp_ (always valid,
parent_map_ == nullptr indicating it is a sentinel).
Updated test: zone_head → zone_sentinel with parent_map_ == nullptr
check; zone-structure walk split into SV walk (up to sentinel) and LV
walk (after sentinel); added assertion that newly inserted SV page lands
immediately before the sentinel.
Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
After comparing the two approaches the pointer-based zone boundary is preferred over the dummy-sentinel approach (head_large_ccp_). Restores cc_shard.h, cc_shard.cpp, and CcPage-Test.cpp to the state they had before the sentinel commit (991a12f), using: git checkout 42d757e~1 -- <files> Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
…et sites Feasibility evaluation: FEASIBLE. All 7 payload-assignment paths already have the LruPage* in scope and execute on the single-threaded shard coroutine, so there are no concurrency or scoping issues. New private helper MaybeMarkAndRezoneAsLargeValue(LruPage*, size_t): - No-op when txservice_large_value_threshold == 0 (disabled), page is null, has_large_value_ already set, or payload_size <= threshold. - Otherwise sets page->has_large_value_ = true and calls UpdateLruList to move the page into the large-value zone immediately. Called eagerly at all 7 payload-assignment sites: 1. PostWriteCc — after SetCurrentPayload / DeserializeCurrentPayload 2. AcquireAllCc — after SetCurrentPayload(payload) 3. ReadCc (outside) — after PassInCurrentPayload 4. RemoteReadOutside — after DeserializeCurrentPayload (Normal path) 5. ReplayLogCc — after DeserializeCurrentPayload (Insert/Update) 6. UploadBatchCc — after SetCurrentPayload(commit_val) for Normal 7. BackFill — after DeserializeCurrentPayload (non-Deleted path) The lazy fallback in CanBeCleaned (has_blocked_large_value_ / needs_rezoning) is retained as a safety net for any path not listed above. Also: - TriggerEagerRezoneForTest() public test helper: invokes MaybeMarkAndRezoneAsLargeValue for every page in the map using its actual max entry PayloadSize(). Simulates what production commit / backfill paths do without going through Execute handlers. - New test case "Eager re-zone on large-value payload": inserts two maps (SV + LV), assigns large payloads, calls TriggerEagerRezoneForTest WITHOUT any clean scan, then asserts correct zone structure. Confirms the eager path fires before CanBeCleaned is ever invoked. Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
The payload-size-aware large-value zone policy is only intended for
EloqKV (ObjectCcMap). It must not affect EloqSQL / EloqDoc tables
served by RangeCcMap, CatalogCcMap, or ClusterConfigCcMap.
Add virtual bool CcMap::IsLargeValueZoneEnabled() const { return false; }.
Default: disabled for every CcMap subclass.
Override in ObjectCcMap: return txservice_large_value_threshold > 0.
Only ObjectCcMap / EloqKV enables the zone.
Update MaybeMarkAndRezoneAsLargeValue in TemplateCcMap:
replace: txservice_large_value_threshold == 0
with: !IsLargeValueZoneEnabled()
Virtual dispatch on 'this' → no-op for RangeCcMap instances.
Update CcPageCleanGuardWithoutKickoutCc::CanBeCleaned:
replace: txservice_large_value_threshold > 0
with: page_->parent_map_ != nullptr &&
page_->parent_map_->IsLargeValueZoneEnabled()
Virtual dispatch via parent_map_ → no-op for non-ObjectCcMap pages.
Update CcPage-Test.cpp:
The tests use TemplateCcMap<..., true, true> (not ObjectCcMap).
Add LargeValueTestCcMap thin test subclass that overrides
IsLargeValueZoneEnabled() to return txservice_large_value_threshold > 0.
Use it in the two large-value test cases.
Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
EloqKV uses ApplyCc (not ReadCc/AcquireCc) to read and write ObjectCcMap. The previous implementation placed MaybeMarkAndRezoneAsLargeValue calls in TemplateCcMap's Execute overloads (PostWriteCc, AcquireAllCc, ReadCc, ReplayLogCc, UploadBatchCc, BackFill). However, ObjectCcMap overrides all of these, so those calls were dead code for EloqKV. Remove all 7 MaybeMarkAndRezoneAsLargeValue calls from TemplateCcMap: - Dead code for EloqKV (ObjectCcMap overrides every method they were in) - No-ops for EloqSQL/EloqDoc (IsLargeValueZoneEnabled() returns false) Add MaybeMarkAndRezoneAsLargeValue at 6 correct payload-set sites in ObjectCcMap: 1. Execute(ApplyCc): apply_and_commit_ path after payload commit (status == Normal) 2. Execute(PostWriteCc): after dirty/pending-command commit when commit_ts > 0 3. Execute(UploadBatchCc): after PassInCurrentPayload for Normal records (migration) 4. Execute(KeyObjectStandbyForwardCc): after commands committed to payload (standby) 5. Execute(ReplayLogCc): after EmplaceAndCommitBufferedTxnCommand, payload == Normal 6. BackFill: after DeserializeCurrentPayload for Normal records (KV load path) Update MaybeMarkAndRezoneAsLargeValue doc comment to list the correct sites. Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
EloqKV holds large values that are expensive to reload from disk. The standard LRU eviction policy does not account for payload size, causing large-value pages to drift to the eviction head at the same rate as cheap small-value entries.
Design
The LRU list is divided into two zones at a stable pointer boundary:
lru_large_value_zone_head_— they cycle through the LRU normally.PayloadSize() > txservice_large_value_threshold) insert beforetail_ccp_— they always stay in the recent half regardless of how many small-value accesses occur.Key changes
CcShard(cc_shard.h/cc_shard.cpp)LruPage *lru_large_value_zone_head_(init&tail_ccp_).UpdateLruList: insert point is&tail_ccp_for LV pages,lru_large_value_zone_head_for SV pages; advance boundary when the first LV page is inserted.DetachLru/ReplaceLru: advance/updatelru_large_value_zone_head_when the boundary page is removed/replaced.LruPage(cc_entry.h)bool has_large_value_{false}— marks a page as belonging to the LV zone; set once, never cleared.CcMap/ObjectCcMap(cc_map.h,object_cc_map.h)virtual bool IsLargeValueZoneEnabled() const { return false; }— default off for all maps.ObjectCcMapoverrides toreturn txservice_large_value_threshold > 0— policy is EloqKV-only;RangeCcMap/CatalogCcMapare unaffected.TemplateCcMap(template_cc_map.h)MaybeMarkAndRezoneAsLargeValue(LruPage*, size_t)private helper: setshas_large_value_and callsUpdateLruListto eagerly move the page into the LV zone on first detection.ObjectCcMap(object_cc_map.h) — 6 eager re-zone call sitesExecute(ApplyCc)—apply_and_commit_path after payload commitExecute(PostWriteCc)— after commit whencommit_ts > 0Execute(UploadBatchCc)— afterPassInCurrentPayloadfor Normal recordsExecute(KeyObjectStandbyForwardCc)— after commands committed to payloadExecute(ReplayLogCc)— afterEmplaceAndCommitBufferedTxnCommandBackFill— afterDeserializeCurrentPayloadfor Normal recordsCcPageCleanGuardWithoutKickoutCc(cc_page_clean_guard.h) — lazy fallbackhas_large_value_and signals re-zone for any LV entry not yet detected by the eager path (safety net only).tx_service_common.hinline size_t txservice_large_value_threshold = 0— threshold in bytes;0disables the policy entirely.tx_service.hlarge_value_thresholdparsed intotxservice_large_value_threshold.Here are some reminders before you submit the pull request
fixes eloqdb/tx_service#issue_id./mtr --suite=mono_main,mono_multi,mono_basic💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.